In [1]:
!pip install opencv-python
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: opencv-python in /root/.local/lib/python3.7/site-packages (4.2.0.34)
Requirement already satisfied: numpy>=1.14.5 in /opt/conda/lib/python3.7/site-packages (from opencv-python) (1.18.1)
In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from glob import glob
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
##Import any other packages you may need here
import cv2
import pydicom
import scipy.stats

EDA is open-ended, and it is up to you to decide how to look at different ways to slice and dice your data. A good starting point is to look at the requirements for the FDA documentation in the final part of this project to guide (some) of the analyses you do.

This EDA should also help to inform you of how pneumonia looks in the wild. E.g. what other types of diseases it's commonly found with, how often it is found, what ages it affects, etc.

Note that this NIH dataset was not specifically acquired for pneumonia. So, while this is a representation of 'pneumonia in the wild,' the prevalence of pneumonia may be different if you were to take only chest x-rays that were acquired in an ER setting with suspicion of pneumonia.

Perform the following EDA:

  • The patient demographic data such as gender, age, patient position,etc. (as it is available)
  • The x-ray views taken (i.e. view position)
  • The number of cases including:
    • number of pneumonia cases,
    • number of non-pneumonia cases
  • The distribution of other diseases that are comorbid with pneumonia
  • Number of disease per patient
  • Pixel-level assessments of the imaging data for healthy & disease states of interest (e.g. histograms of intensity values) and compare distributions across diseases.

Note: use full NIH data to perform the first a few EDA items and use sample_labels.csv for the pixel-level assassements.

Also, describe your findings and how will you set up the model training based on the findings.

Q.1 What type of images were used in this dataset ?

Ans - GrayScale

Q.2 What body parts were examined in this dataset ?

Ans - Chest

Q.3 What was the age range of patients that were tested positive with Pneumonia?

Ans - Age Between 0 - 100

Q.4 What type of image positions were present in this dataset ?

Ans - AP and PA

Q.5 What was the total count of patients with penumonia in this dataset ?

Ans - 1431

Q.6 What type of diseases are most common comorbidities of pneumonia ?

Ans - The top 5 most common co occurring comorbidities with pneumonia present in this dataset are:-

    1. Infiltration with Pneumonia
    2. Edema, Infiltration with Pneumonia
    3. Atelectasis with Pneumonia
    4. Edema with Pneumonia
    5. Effusion with Pneumonia

Based on the frequency these are the diseases that most commonly occur with pneumonia:-

    1. Infiltration
    2. Edema
    3. Effusion
    4. Atelectasis

Q.7 Were you able to spot any differences in pixel intensities with patients which have pneumonia and patients which are healthy ? Ans - On examining the pixel intensities of 3 differrent patients that were healthy(No Finding):- Here the first value represents the image mean while the second value reprsents image standard deviation.

 1. 131.74216710869442, 51.71516257188659
 2. 169.66516806341969, 51.25696493442237
 3. 151.1333099500918, 53.89837944789393

While the pixel intensities of 3 differrent patients that were tested positive with pneumonia:- Here the first value represents the image mean while the second value reprsents image standard deviation.

 1. 123.59639087519794, 40.048603370246795
 2. 129.08860275952935, 50.911206559341856
 3. 151.73789234004076, 54.98203429309973

Conclusion:- When comparing the above result we see that there is not much difference between the mean and standard deviation of a image in healthy patient and a patient that was tested positive with Pneumonia.

Q.8 What was the difference between pixel intensities of Pneumonia versus all the other diseases, what conclusion can be drawn from examing them ?

Note:- Kindly to see the results that are reffered to in this explanation look at the list of means and standard deviation provided in the last block.

Ans - As we can from the results the presence of other diseases in the chest X ray Scan may lower the accuracy of our model as the mean and standard deviation of the diseases are very close to each other. While the mean of the disease are very close to each other Hernia, Nodule and Atelectasis have a slighly higher mean in some of the images. Also the thing to notice is that the mean of the labels Infiltrationa and Edema are almost same as mean of the images which contain Pneumonia this may lower the accuracy of the model if these images are present in the dataset.

Conclusion:- While the presence of other images in the chest X ray can lower the accuracy of model the most difficult ones to differentitate are Infiltration and Edema as they have the same image mean as that of pimages that have pneumonia.

In [3]:
## Below is some helper code to read data for you.
## Load NIH data
all_xray_df = pd.read_csv('/data/Data_Entry_2017.csv')
all_xray_df.dropna(axis=1, how='all')

all_xray_df.dropna(axis=1, inplace=True)
all_image_paths = {os.path.basename(x): x for x in glob(os.path.join('/data','images*/images', '', '*.png'))}

print('Scans found:', len(all_image_paths), ', Total Headers', all_xray_df.shape[0])
all_xray_df['Image Path'] = all_xray_df['Image Index'].map(all_image_paths.get)

all_xray_df.sample(3)
Scans found: 112120 , Total Headers 112120
Out[3]:
Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position OriginalImage[Width Height] OriginalImagePixelSpacing[x y] Image Path
15296 00004006_017.png No Finding 17 4006 33 F PA 2992 2991 0.143 0.143 /data/images_003/images/00004006_017.png
76337 00018732_000.png No Finding 0 18732 44 F AP 2500 2048 0.168 0.168 /data/images_009/images/00018732_000.png
8898 00002345_010.png Infiltration|Nodule|Pleural_Thickening 10 2345 57 F PA 2544 3056 0.139 0.139 /data/images_002/images/00002345_010.png
In [4]:
## EDA
# Demographic data for Patient Gender
plt.figure(figsize=(6,6))
all_xray_df['Patient Gender'].value_counts().plot(kind='bar') 
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6c2d4dd10>
In [5]:
# Demographic data for Patient Age
plt.figure(figsize=(6,6))
plt.xlim(0, 150)
plt.hist(all_xray_df['Patient Age'])
Out[5]:
(array([4.1465e+04, 7.0265e+04, 3.7400e+02, 1.0000e+01, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 6.0000e+00]),
 array([  1. ,  42.3,  83.6, 124.9, 166.2, 207.5, 248.8, 290.1, 331.4,
        372.7, 414. ]),
 <a list of 10 Patch objects>)
In [6]:
# Demographic data for Patient Position
plt.figure(figsize=(6,6))
all_xray_df['View Position'].value_counts().plot(kind='bar')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6c2cac850>
In [7]:
#Sample X ray Images from the dataset
j = 1
plt.figure(figsize=(16,16))
for i in all_xray_df.sample(3)["Image Path"]:
    plt.subplot(1,3,j)
    j=j+1
    plt.imshow(cv2.imread(i))
In [8]:
#Pneumonia Count
pneumonia_count = len(all_xray_df[all_xray_df['Finding Labels'].str.contains("Pneumonia")])
non_pneumonia_count = len(all_xray_df['Finding Labels'])- pneumonia_count
print("The total number of Pnemounia Cases in the dataset are:- " + str(pneumonia_count))
print("The total number of Non Pnemounia Cases in the dataset are:- " + str(non_pneumonia_count))
The total number of Pnemounia Cases in the dataset are:- 1431
The total number of Non Pnemounia Cases in the dataset are:- 110689
In [9]:
# Finding Unique Label Names
from itertools import chain

all_labels = np.unique(list(chain(*all_xray_df['Finding Labels'].map(lambda x: x.split('|')).tolist())))
all_labels = [x for x in all_labels if len(x)>0]
print('All Labels ({}): {}'.format(len(all_labels), all_labels))
for c_label in all_labels:
    if len(c_label)>1: # leave out empty labels
        all_xray_df[c_label] = all_xray_df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
all_xray_df.sample(3)
All Labels (15): ['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema', 'Effusion', 'Emphysema', 'Fibrosis', 'Hernia', 'Infiltration', 'Mass', 'No Finding', 'Nodule', 'Pleural_Thickening', 'Pneumonia', 'Pneumothorax']
Out[9]:
Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position OriginalImage[Width Height] OriginalImagePixelSpacing[x ... Emphysema Fibrosis Hernia Infiltration Mass No Finding Nodule Pleural_Thickening Pneumonia Pneumothorax
2951 00000785_001.png Nodule 1 785 38 F AP 3056 2544 0.139 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
21814 00005778_035.png Atelectasis 35 5778 66 M AP 3056 2544 0.139 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
30076 00007846_000.png No Finding 0 7846 47 M PA 2500 2048 0.168 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

3 rows × 27 columns

In [10]:
all_xray_df[all_labels].sum()/len(all_xray_df)
Out[10]:
Atelectasis           0.103095
Cardiomegaly          0.024759
Consolidation         0.041625
Edema                 0.020540
Effusion              0.118775
Emphysema             0.022440
Fibrosis              0.015037
Hernia                0.002025
Infiltration          0.177435
Mass                  0.051570
No Finding            0.538361
Nodule                0.056466
Pleural_Thickening    0.030191
Pneumonia             0.012763
Pneumothorax          0.047289
dtype: float64
In [11]:
plt.figure(figsize=(16,6))
ax = all_xray_df[all_labels].sum().plot(kind='bar')
ax.set(ylabel = 'Number of Images with Label')
Out[11]:
[Text(0, 0.5, 'Number of Images with Label')]
In [12]:
## Co occurring Diseases
plt.figure(figsize=(16,6))
all_xray_df[all_xray_df.Pneumonia==1]['Finding Labels'].value_counts().plot(kind='bar')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6c1b72990>
In [13]:
##Since there are many combinations of potential findings, We will be looking at the 30 most common co-occurrences:
plt.figure(figsize=(16,6))
all_xray_df[all_xray_df.Pneumonia==1]['Finding Labels'].value_counts()[0:30].plot(kind='bar')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6c1b21e10>
In [15]:
##Frequency Distribution Label wise of Each disease that co occurr with Pneumonia

diseases_freq ={}
for i in all_labels:
    if i == "No Finding":
        continue
    diseases_freq[i] = len(all_xray_df[all_xray_df.Pneumonia==1][all_xray_df[i]==1])
    
diseases_freq = {k: v for k, v in sorted(diseases_freq.items(), key=lambda item: item[1], reverse=True)}

plt.figure(figsize=(16,6))
plt.bar(range(len(diseases_freq)), list(diseases_freq.values()), align='center')
plt.xticks(range(len(diseases_freq)), list(diseases_freq.keys()), rotation= 45)
plt.title("Frequency Distribution Label wise of Each disease that co occurr with Pneumonia")
plt.show()
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  import sys
In [16]:
#Age distribution for Pneumonia
plt.figure()
plt.hist([all_xray_df[all_xray_df["Pneumonia"]==1]['Patient Age'].values], bins = 10, range=[0, 120])
Out[16]:
(array([ 46., 152., 270., 272., 358., 274.,  54.,   4.,   0.,   0.]),
 array([  0.,  12.,  24.,  36.,  48.,  60.,  72.,  84.,  96., 108., 120.]),
 <a list of 10 Patch objects>)
In [17]:
#Gender distribution for Pneumonia
plt.figure()
all_xray_df[all_xray_df.Pneumonia==1]['Patient Gender'].value_counts().plot(kind='bar')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6c1e65710>
In [18]:
#Number of disease per patient
count_list = [(all_xray_df.iloc[i][all_labels].sum()) for i in range(0, len(all_xray_df))]
    
idx = pd.Index(count_list, name ='disease_count').astype('int64')  
idx.value_counts()      
Out[18]:
1    91324
2    14306
3     4856
4     1247
5      301
6       67
7       16
9        2
8        1
dtype: int64
In [19]:
plt.figure(figsize=(16,6))
plt.xlabel('Number Of Diseases', fontsize=14)
plt.ylabel('Number Of People', fontsize=14)
idx.sort_values().value_counts().sort_index().plot(kind='bar')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6c1d53590>
In [20]:
## Load 'sample_labels.csv' data for pixel level assessments
sample_df = pd.read_csv('sample_labels.csv')
sample_df['Image Path'] = sample_df['Image Index'].map(all_image_paths.get)
sample_df.sample(3)
Out[20]:
Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position OriginalImageWidth OriginalImageHeight OriginalImagePixelSpacing_x OriginalImagePixelSpacing_y Image Path
2523 00012671_002.png No Finding 2 12671 069Y F AP 2500 2048 0.168 0.168 /data/images_006/images/00012671_002.png
2316 00011804_012.png Effusion|Mass|Pleural_Thickening|Pneumothorax 12 11804 057Y M AP 2500 2048 0.168 0.168 /data/images_006/images/00011804_012.png
291 00001558_005.png Atelectasis|Consolidation|Effusion 5 1558 055Y M PA 2992 2991 0.143 0.143 /data/images_002/images/00001558_005.png
In [21]:
def show_label_images(label):
    j = 1
    img_list = []
    plt.figure(figsize=(16,16))
    for i in sample_df[sample_df["Finding Labels"] == label].sample(3)["Image Path"]:
        img_list.append(i)
        plt.subplot(1,3,j)
        j=j+1
        plt.imshow(cv2.imread(i))
        plt.title(label + " Image " + str(j-1))
    return img_list
In [22]:
def n_pixel_intensity(img_list ,label):
    j = 1
    plt.figure(figsize=(16,5))
    
    mean_std_list = []
    for i in img_list:
        plt.subplot(1,3,j)
        j=j+1
        img = cv2.imread(i)
        img_mask = (img > 50)
        img = img[img_mask]

        mean_intensity = np.mean(img)
        std_intensity = np.std(img)
        new_img = img.copy()
        new_img = (new_img - mean_intensity)/std_intensity

        mean_std_list.append([mean_intensity, std_intensity])
        
        plt.hist(new_img.ravel(), bins=256)
        plt.title(label + " (Pixel intensities)" + " Image " + str(j-1) )
        
    return mean_std_list
In [23]:
# Some samples of Pixel level intensities for the labels present in the Data
mean_std_list_all = []
for i in all_labels:
    img_list = show_label_images(i)
    mean_std_list = n_pixel_intensity(img_list, i)
    mean_std_list_all.append(mean_std_list)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  after removing the cwd from sys.path.
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  This is separate from the ipykernel package so we can avoid doing imports until
In [24]:
# Mean and standard deviation values for the above shown graphs
for i in range (0 , len(all_labels)-1):
    print (all_labels[i])
    print(mean_std_list_all[i][0])
    print(mean_std_list_all[i][1])
    print(mean_std_list_all[i][2])
Atelectasis
[183.4759762825692, 38.8228117167868]
[150.9077449387873, 53.3727605866181]
[164.8735418508101, 43.975288911581224]
Cardiomegaly
[128.83307487266242, 51.115550075020415]
[136.8640644128449, 36.53535959557387]
[151.27434639121267, 49.43427498187874]
Consolidation
[136.25213084058367, 39.549099204060695]
[113.47450125062531, 24.289356878428396]
[129.49875612102687, 41.714859305629915]
Edema
[131.64881847841565, 41.05476870142153]
[138.49302833802707, 35.704884049171085]
[140.423900789177, 42.68560075858166]
Effusion
[135.64835428293796, 42.12175303588151]
[169.2843350674587, 43.03083694780158]
[153.63621400994685, 47.35331560498529]
Emphysema
[144.59129684130394, 49.112260792199514]
[146.77596605825832, 42.77358986132117]
[142.79342620709372, 56.445018552015604]
Fibrosis
[144.55462002996606, 50.902789324082065]
[149.7990996825899, 52.233711778621036]
[127.37757741968895, 38.56289020480486]
Hernia
[149.54189197143197, 55.56731908810213]
[177.59617113790227, 48.689447427204044]
[186.05325064034304, 43.63219369213618]
Infiltration
[135.49848499771156, 55.2401434631041]
[127.53458894223058, 30.906360104018272]
[127.83930699513058, 51.549756752407646]
Mass
[131.88174985808556, 45.21619576914805]
[125.12684054720151, 29.677750093860773]
[165.81112653383136, 33.668576780077075]
No Finding
[131.74216710869442, 51.71516257188659]
[169.66516806341969, 51.25696493442237]
[151.1333099500918, 53.89837944789393]
Nodule
[155.74548456981836, 42.93895299822878]
[185.57509241490462, 43.475136260455365]
[144.527462547723, 30.575544060658075]
Pleural_Thickening
[144.10815154531028, 52.333136894965044]
[135.79311389845566, 50.963504154178956]
[141.2528284195034, 56.94801298093054]
Pneumonia
[123.59639087519794, 40.048603370246795]
[129.08860275952935, 50.911206559341856]
[151.73789234004076, 54.98203429309973]
In [78]:
#Bounding Box Analysis for Later

# bbox = pd.read_csv('/data/BBox_List_2017.csv')
# sample_pn = bbox[bbox["Finding Label"]=="Pneumonia"].sample(1)
# sample_pn
# #full image path for given file name
# import os
# def file_path(img_name):
#     file_loc = {}
#     for dirs,subdirs, files in os.walk('/data/'):
#         for file in files:
#             file_loc[file] = os.path.join(dirs, file)
#     return file_loc[img_name]
# #function for pixel level intensity
# def n_pixel_intensity(img_data, color='blue'):
#     plt.figure(figsize=(5,5))
#     plt.title('Normalized Image Pixel Intensity')

#     img_mask = (img_data > 50)
#     img_data = img_data[img_mask]
#     mean_intensity = np.mean(img_data)
#     std_intensity = np.std(img_data)
#     new_img = img_data.copy()
#     new_img = (new_img - mean_intensity)/std_intensity
#     plt.hist(new_img.ravel(), bins=256, color=color);
#     return mean_intensity, std_intensity

# sample_pn_img = [x for x in sample_pn["Image Index"]][0]
# img_data = cv2.imread(file_path(sample_pn_img))
# plt.imshow(img_data, cmap='gray')

# mean_intensity, std_intensity = n_pixel_intensity(img_data)

# print("mean intensity:", mean_intensity)
# print("std intensity:",std_intensity)

# from PIL import Image
# img = Image.open(file_loc[sample_pn_img])
# temp = bbox[(bbox['Image Index'] == sample_pn_img) & (bbox['Finding Label'] == "Pneumonia")]
# im = img.crop((int(temp['Bbox [x']), int(temp['y']), (int(temp['Bbox [x'])+int(temp['w'])+1), (int(temp['y'])+int(temp['h]']))))
# plt.imshow(im, cmap='gray')

# pix = np.array(im)
# mean_intensity, std_intensity = n_pixel_intensity(pix, color='red')

# print("mean intensity:", mean_intensity)
# print("std intensity:",std_intensity)

# from itertools import chain

# all_labels = np.unique(list(chain(*bbox['Finding Label'].map(lambda x: x.split('|')).tolist())))
# all_labels = [x for x in all_labels if len(x)>0]
# print('All Labels ({}): {}'.format(len(all_labels), all_labels))
# for c_label in all_labels:
#     if len(c_label)>1: # leave out empty labels
#         bbox[c_label] = bbox['Finding Label'].map(lambda finding: 1.0 if c_label in finding else 0)
# bbox.sample(3)
In [ ]: